monitor lifecycle conductor by benzekrimaha · Pull Request #2723 · scality/backbeat

benzekrimaha · 2026-03-02T16:23:04Z

bert-e · 2026-03-02T16:23:08Z

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-02T16:23:13Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

codecov · 2026-03-02T16:48:01Z

Codecov Report

❌ Patch coverage is 97.47899% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.54%. Comparing base (145f7a6) to head (de1170d).
⚠️ Report is 46 commits behind head on development/9.3.

Files with missing lines	Patch %	Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	80.00%	3 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
extensions/lifecycle/LifecycleConfigValidator.js	`100.00% <100.00%> (ø)`
extensions/lifecycle/LifecycleMetrics.js	`100.00% <100.00%> (+1.78%)`	⬆️
...tensions/lifecycle/conductor/LifecycleConductor.js	`84.40% <100.00%> (+0.69%)`	⬆️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	`92.90% <100.00%> (+0.14%)`	⬆️
extensions/lifecycle/tasks/LifecycleTask.js	`91.63% <100.00%> (+0.08%)`	⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js	`88.99% <100.00%> (+0.10%)`	⬆️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	`81.33% <100.00%> (+0.25%)`	⬆️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	`92.15% <100.00%> (+0.23%)`	⬆️
lib/models/ActionQueueEntry.js	`96.29% <ø> (ø)`
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	`80.83% <80.00%> (+0.96%)`	⬆️

... and 6 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.37% <ø> (ø)`
Core Library	`80.57% <ø> (-0.13%)`	⬇️
Ingestion	`70.53% <ø> (-0.62%)`	⬇️
Lifecycle	`79.27% <97.47%> (+0.65%)`	⬆️
Oplog Populator	`85.83% <ø> (ø)`
Replication	`59.61% <ø> (-0.04%)`	⬇️
Bucket Scanner	`85.76% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
+ Coverage            74.48%   74.54%   +0.05%     
===================================================
  Files                  200      200              
  Lines                13603    13690      +87     
===================================================
+ Hits                 10132    10205      +73     
- Misses                3461     3475      +14     
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.09% <0.84%> (-0.06%)`	⬇️
api:routes	`8.91% <0.84%> (-0.06%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`9.13% <10.08%> (-0.89%)`	⬇️
ingestion	`12.42% <0.84%> (-0.13%)`	⬇️
lifecycle	`19.04% <67.22%> (+0.19%)`	⬆️
notification	`1.02% <0.00%> (-0.01%)`	⬇️
oplogPopulator	`0.14% <0.00%> (-0.01%)`	⬇️
replication	`18.45% <10.08%> (-0.04%)`	⬇️
unit	`51.45% <90.75%> (+0.43%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

delthas · 2026-03-18T16:29:03Z

        const log = this.logger.newRequestLogger();
-        const start = new Date();
+        const start = Date.now();
+        this._scanId = uuid();


Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

Issue: BB-740

claude · 2026-06-01T12:26:45Z

tests/unit/lifecycle/LifecycleTask.spec.js:719 — Bug: test asserts contextInfo.requestId is set and reqId is undefined, but _getBucketEntryContext produces reqId. Test will fail. Either fix the test assertions or override _getBucketEntryContext in LifecycleTaskV2 to preserve the old requestId key.
extensions/lifecycle/tasks/LifecycleTask.js:140 — Behavior change: V2 continuation entries now use reqId instead of requestId in contextInfo, changing the Kafka message format for V2 continuation entries during rolling upgrades. Low impact (log correlation only) but worth confirming intentional.
extensions/lifecycle/conductor/LifecycleConductor.js:530 — scanId parameter added to listBuckets, listBucketdBuckets, and listMongodbBuckets but unused in all three — already available via this._currentScanId.

Review by Claude Code

Issue: BB-740

claude · 2026-06-01T12:44:26Z

tests/unit/lifecycle/LifecycleConductor.spec.js:325 — listBuckets stub uses 4 params (mQueue, scanId, log, cb) but method signature is (queue, log, cb) (3 params). cb is always undefined, causing TypeError. This breaks an existing test that was working before.
- tests/unit/lifecycle/LifecycleConductor.spec.js:240 — Same mismatch in new test 'should generate a conductorScanId'. The UUID assertion on scanId also fails because it receives the logger object, not a string. Use conductor._currentScanId instead.
- tests/unit/lifecycle/LifecycleConductor.spec.js:251 — Same mismatch in new test 'should not close scan metrics when throttling occurs after scan start'.

All three are the same root cause: the test author assumed scanId would be passed as a parameter to listBuckets, but the method signature was never updated. The scan ID is available via conductor._currentScanId (set before listBuckets is called).

Review by Claude Code

Issue: BB-740

claude · 2026-06-01T13:58:57Z

LGTM — solid implementation of conductor scan tracking.

Reviewed: new Prometheus metrics (s3_lifecycle_latest_batch_end_time, s3_lifecycle_latest_batch_bucket_count, s3_lifecycle_bucket_processor_scan_messages_processed_total, s3_lifecycle_bucket_processor_scan_message_age_seconds), LifecycleStalledScan and LifecycleBucketProcessorMultipleParallelScans alerts, scan context propagation through conductor → bucket processor → lifecycle tasks → action entries, Grafana dashboard panels, and comprehensive test coverage.

Key design choices verified:
- conductor_scan_id label cardinality is managed by cleanup timers in prom-client; Prometheus TSDB retention is documented as an accepted tradeoff
- Rolling upgrade safety: undefined conductorScanId from old conductors is handled gracefully (try-catch in metrics, early-return guards on timestamp validation)
- Throttling-after-scan-start intentionally leaves scan state intact so LifecycleStalledScan alert can fire
- V1/V2 continuation entries correctly preserve scan context via _makeContinuationEntry while maintaining the reqId vs requestId distinction

One minor observation: BackbeatTestConsumer._expectedConductorScanId and _expectedConductorScanStartTimestamp (lines 94-95) are assigned but never read — they appear to be dead code.

Review by Claude Code

Issue: BB-740

claude · 2026-06-01T14:35:09Z

extensions/lifecycle/LifecycleMetrics.js:345 — During rolling upgrades, conductorScanId is undefined for messages from old conductors, creating a conductor_scan_id="undefined" Prometheus label that could cause a spurious LifecycleBucketProcessorMultipleParallelScans alert during the upgrade window. Consider guarding with a falsy check or using a sentinel value.
extensions/lifecycle/conductor/LifecycleConductor.js:478 — Pre-existing: when listBuckets returns Throttling after scan start, _batchInProgress stays true and the new _currentScanId/_currentScanStartTimestamp are not cleared. This PR's new test codifies this behavior, but it can leave the conductor stuck until restart. Consider cleaning up scan state on all error paths.

Review by Claude Code

Issue: BB-740

claude · 2026-06-01T15:21:12Z

PR #2723: monitor lifecycle conductor — Review Summary

This PR adds comprehensive lifecycle conductor scan monitoring: a scan id (UUID v4) propagated through the Kafka message pipeline, new Prometheus metrics (scan end time, bucket count, per-scan message counter, message age histogram), Grafana dashboard panels, and two new alerts (LifecycleStalledScan, LifecycleBucketProcessorMultipleParallelScans). The refactoring of context propagation via _makeContinuationEntry, _getActionContext, and _getScanContext is clean and reduces duplication across LifecycleTask/LifecycleTaskV2. Tests are thorough.

Findings:

monitoring/lifecycle/alerts.yaml:106 — LifecycleStalledScan uses or vector(0) for the end-time metric, which will trigger a false positive on fresh deployments where no scan has ever completed. The first scan's start time will always be greater than 0, satisfying the condition immediately.
extensions/lifecycle/LifecycleMetrics.js:109 — The conductor_scan_id label on the per-scan counter creates unbounded time series (one per UUID). The cleanup timer mitigates this, but resets on every message, meaning active scans never get cleaned up. A hard cap on concurrent tracked scan ids would provide a safety valve.
extensions/lifecycle/LifecycleMetrics.js:165 — configureLifecycleScanMetricRetention does not reschedule existing timers; only future timers use the new value. Minor issue since config is loaded at startup.
extensions/lifecycle/conductor/LifecycleConductor.js:476 — elapsedMs measures from a start variable that gets reassigned mid-waterfall, so error paths that short-circuit before the reassignment measure a slightly different time span than expected.

No security issues, no breaking changes to Kafka message formats (new fields are additive and handled gracefully during rolling upgrades).

Review by Claude Code

Issue: BB-740

claude · 2026-06-01T15:47:12Z

LGTM — well-structured PR. The scan-context propagation through conductor → bucket-task messages → action entries is thorough and consistent. Metric cleanup timers correctly prevent unbounded prom-client memory growth. Alert PromQL for LifecycleStalledScan handles the edge cases (first scan, scan completion, vector(0) fallback). V1/V2 continuation-entry context is correctly preserved via _makeContinuationEntry with the LifecycleTaskV2 override of _getBucketEntryRequestIdContext. Test coverage is solid across metrics, conductor lifecycle, scan-context propagation, and the BackbeatTestConsumer sentinel handling.

Review by Claude Code

delthas · 2026-06-03T10:48:21Z

+                    nBucketsQueued,
+                });
+                if (scanStarted) {
+                    this._completeCurrentScan(log, totalBucketsListed);


On the error-after-start path we still call _completeCurrentScan, which sets latest_batch_end_time and latest_batch_bucket_count. Two consequences:

A failed scan then looks "completed" to LifecycleStalledScan (the alert sees end ≥ start), so a scan that started, errored, and wedged won't trip the stalled-scan alert it's meant to catch.

bucket_count is published with a misleading partial value.

Resetting _currentScanId/_currentScanStartTimestamp on failure is correct, but should we record the end metrics on the error path at all? Suggest resetting scan state without setting end-time/bucket-count when the scan failed.

delthas · 2026-06-03T10:48:24Z

+                    conductorScanId: scanId,
+                    conductorScanStartTimestamp: start,
+                });
+                LifecycleMetrics.onProcessBuckets(log, start);


This changes the meaning of an existing metric. s3_lifecycle_latest_batch_start_time used to be set at scan completion (previously onProcessBuckets was called in the success callback); it's now set at scan start.

That's a behavior change for anything already reading this metric — in particular LifecycleLateScan. After this change LateScan means "the conductor hasn't started a scan recently" and no longer catches "a scan started but hung"; that coverage moves entirely to the new LifecycleStalledScan. So the two new StalledScan rules aren't purely additive — they backfill coverage this flip removed from LateScan.

Can we confirm this is intended and call it out explicitly in the PR description / changelog so downstream dashboards and alerts are aware?

delthas · 2026-06-03T10:48:35Z

+});
+
+const bucketProcessorScanMessageAgeSeconds = ZenkoMetrics.createHistogram({
+    name: 's3_lifecycle_bucket_processor_scan_message_age_seconds',


The help text says the age is measured "when they finish processing in the bucket processor," but onBucketProcessorScanMessageReceived is called at message pickup (in _processBucketEntry, before fetching the bucket lifecycle config or scheduling the task). So this histogram actually measures "elapsed wall-time since the scan started, sampled at dequeue" — a backlog/lag signal, not processing time. Continuation slices also inherit the original scan-start timestamp, so age keeps growing across a long scan regardless of when a given slice was enqueued.

Either fix the help text to describe what's measured, or move the observation to actual task completion if processing time is what we want.

delthas · 2026-06-03T10:48:39Z

+});
+
+const bucketProcessorScanMessagesProcessed = ZenkoMetrics.createCounter({
+    name: 's3_lifecycle_bucket_processor_scan_messages_processed_total',


Naming nit: this counter is incremented at message receipt, before processing and regardless of success or object count (the JSDoc on onBucketProcessorScanMessageReceived says as much), yet it's named ..._scan_messages_processed_total and the method says ...Received. "processed" overstates what it counts. Suggest ..._scan_messages_received_total to match the semantics and the method name.

delthas · 2026-06-03T10:48:50Z

                    assert(parsedMsg.contextInfo?.reqId, 'expected contextInfo.reqId field');
-                    parsedMsg.contextInfo.reqId = expectedMsg.value.contextInfo?.reqId;
+                    expectedValue.contextInfo.reqId = parsedMsg.contextInfo.reqId;
+                    if (expectedValue.contextInfo?.conductorScanId === 'test-scan-id') {


Building on François's earlier point about this utility: this doesn't actually test scan-id propagation. It overwrites the expected conductorScanId/conductorScanStartTimestamp with whatever the actual message contained, and never asserts the fields are present — so a message that omitted them entirely would still pass deepStrictEqual. Unlike the reqId case just above, there's no assert(parsedMsg.contextInfo?.conductorScanId, ...) guarding presence.

Either add a presence assertion (mirroring the reqId check) or move the scan-id assertion back into the test rather than the shared consumer.

delthas · 2026-06-03T10:48:54Z

    processActionEntry(entry, done) {
        const startTime = Date.now();
        const log = this.logger.newRequestLogger();
+        const conductorScanId = entry.getContextAttribute('conductorScanId');


Consistency: there are now three different ways scan context gets onto the logger across the action tasks. LifecycleUpdateExpirationTask does log.addDefaultFields(entry.getLogInfo()), while here and in LifecycleUpdateTransitionTask we manually getContextAttribute('conductorScanId') + getContextAttribute('conductorScanStartTimestamp') + addDefaultFields.

Since ActionQueueEntry._loggedAttributes now includes both scan fields, the manual extraction is redundant — these two could use entry.getLogInfo() too (this is the direction François asked for earlier). Standardizing on one approach would be cleaner.

delthas · 2026-06-03T10:49:02Z

                    }
                },
                "concurrency": 10,
+                "scanMetricRetentionS": 86400,


The retention default now lives in three places: 86400 here, and DEFAULT_SCAN_METRIC_RETENTION_S = 24 * 60 * 60 duplicated in both LifecycleMetrics.js and LifecycleConfigValidator.js. Suggest a single exported constant (re-used by the validator's .default(...)) so these can't drift apart.

delthas · 2026-06-03T10:49:06Z

    });

    describe('_indexesGetOrCreate', () => {
+        it('should include conductor scan id in task context', () => {


This test exercises _taskToMessage, but it's placed under describe('_indexesGetOrCreate'). Minor, but worth moving it to a block that matches what it covers (or its own describe('_taskToMessage')).

delthas · 2026-06-03T10:49:12Z

+    }
+
+    const ageSeconds = (Date.now() - conductorScanStartTimestamp) / 1000;
+    if (ageSeconds >= 0) {


The ageSeconds >= 0 guard drops the observation entirely when the computed age is negative, rather than clamping it to 0. Since the age is a cross-host subtraction (Date.now() here minus the conductor's Date.now() carried in the message), small negative values are expected and dropping them silently removes the fastest samples, biasing the histogram upward. Suggest observe(..., Math.max(0, ageSeconds)) so those samples still land in the lowest bucket.

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03

benzekrimaha marked this pull request as ready for review March 11, 2026 16:35

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48

benzekrimaha requested review from a team, SylvainSenechal and francoisferrand March 13, 2026 08:49

francoisferrand requested a review from delthas March 18, 2026 09:19

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04

benzekrimaha changed the title ~~Improvement/bb 740 monitor lifecycle conductor~~ Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026

francoisferrand changed the title ~~Improvement/BB-740 monitor lifecycle conductor~~ monitor lifecycle conductor Mar 18, 2026

delthas reviewed Mar 18, 2026

View reviewed changes

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated

claude Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated

fixup! Refine lifecycle scan alerts.

0a2db86

Issue: BB-740